Building a Simple Linear Model to Better Understand Regression Methods
Get in Loser, We’re Fitting Lines to Data
model <- lm(exam_score ~ hours_studied, data = df)
df <-
df |>
mutate(
fitted = model$fitted.values,
residual = model$residuals
)
df |>
select(hours_studied, exam_score, fitted, residual) |>
janitor::clean_names(case = "title") |>
slice_sample(n = 10) |>
gt() |>
fmt_number(columns = c(Fitted, Residual), decimals = 2) |>
cols_align(align = "center", columns = everything())| Hours Studied | Exam Score | Fitted | Residual |
|---|---|---|---|
| 8 | 61 | 63.58 | −2.58 |
| 17 | 65 | 66.20 | −1.20 |
| 20 | 68 | 67.08 | 0.92 |
| 15 | 66 | 65.62 | 0.38 |
| 23 | 70 | 67.95 | 2.05 |
| 21 | 70 | 67.37 | 2.63 |
| 22 | 67 | 67.66 | −0.66 |
| 9 | 65 | 63.87 | 1.13 |
| 27 | 67 | 69.12 | −2.12 |
| 11 | 64 | 64.45 | −0.45 |
df |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_segment(
aes(
x = hours_studied, y= fitted,
xend = hours_studied, yend = fitted + residual
),
linewidth = 1, color = "#ED8B00"
) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted),
linewidth = 1, colour = "#005EB8"
) +
labs(x = "Hours Studied", y = "Exam Score")df |>
slice_sample(n = 10) |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_segment(
aes(
x = hours_studied, y= fitted,
xend = hours_studied, yend = fitted + residual
),
linewidth = 1, color = "#ED8B00"
) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted),
linewidth = 1, colour = "#005EB8"
) +
labs(x = "Hours Studied", y = "Exam Score")df |>
slice_sample(n = 100) |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_segment(
aes(
x = hours_studied, y= fitted,
xend = hours_studied, yend = fitted + residual
),
linewidth = 1, color = "#ED8B00"
) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted),
linewidth = 1, colour = "#005EB8"
) +
labs(x = "Hours Studied", y = "Exam Score")df |>
ggplot(aes(x = hours_studied, y = exam_score)) +
geom_segment(
aes(
x = hours_studied, y= fitted,
xend = hours_studied, yend = fitted + residual
),
linewidth = 1, color = "#ED8B00"
) +
geom_point(shape = 21, fill = "white", size = 1.5, stroke = 1) +
geom_line(
aes(x = hours_studied, y = fitted),
linewidth = 1, colour = "#005EB8"
) +
labs(x = "Hours Studied", y = "Exam Score")\(\text{Residual Sum of Squares (RSS)} = \sum_{i=1}^n \left( y_i - (\beta_0 + \beta_1 x_i) \right)^2\)
Minimising RSS gives us the line that best fits the data, but we don’t know what \(\beta_0\) or \(\beta_1\) are!
Minimize RSS (Solve for \(\beta_0\), then \(\beta_1\))
Our good friends \(\beta_1\), \(\beta_0\), and \(\epsilon\).
\[Y = \beta_0 + \beta_1 X + \epsilon\]
\[\beta_1 = \frac{\text{Cov}(X, Y)}{\text{Var}(X)} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} \]
\[\beta_0 = \bar{y} - \beta_1 \bar{x} \]
\[\hat{y}_i = \beta_0 + \beta_1 x_i \]
I’ve Got All These Variables, What if I Just Regressed Them?
Contact:
Code & Slides:
Paul Johnson // Linear Regression from Scratch // Nov 28, 2024